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ABSTRACT 

This  paper  is  primarily  expository  in  nature  and  focuses  on  the  all 
pervasive  importance  of  f’/f  in  efficient  estimation  of  location,  with 
primary  emphasis  on  the  role  of  f'/f  in  robust  estimation.  Connections 
between  M  estimators  (maximum  likelihood-like),  R  (rank)  estimators  and  L 
estimators  (linear  combinations  of  order  statistics)  are  discussed  and  an 
alternative  heuristic  explanation  of  f'/f  is  given  showing  why  it  is  an 
intuitively  reasonable  quantity  on  which  to  base  estimation.  The  asymptotic 
relative  efficiency  of  each  class  of  estimators  is  shown  to  be  the  square  of  a 
correlation  coefficient  related  to  f’/f  and  reasons  are  given  as  to  why  R 
estimators  might  often  prove  to  have  superior  robustness  properties  relative 
to  L  and  M  estimators. 
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SIGNIFICANCE  AND  EXPLANATION 


There  is  considerable  discussion  in  statistics  as  to  how  one  should 
estimate  the  location  (sometimes  called  the  central  tendency)  of  a 
distribution.  Traditionally  the  sample  mean  and  its  generalization,  least 
squares,  have  been  used,  often  in  conjunction  with  outlier  rejection  rules. 
However,  there  may  be  considerable  loss  in  efficiency  if  the  mean  or  any  other 
preselected  estimator  is  used  with  data  for  which  it  is  not  appropriate. 

This  paper  provides  insight  into  which  characteristics  of  the  parent 
distribution  of  the  data  have  a  practical  impact  on  the  efficiency  of  the 
estimator.  Three  classes  of  estimators  are  explored  and  it  is  shown  that  in 
all  three  the  key  quantity  is  f'/f  where  f  is  the  density  function  of  the 
data  and  f'  is  its  derivative.  Correlation  coefficients  between  the  f’/f 
of  the  hypothesized  data  and  the  corresponding  quantity  g'/g  for  the  actual 
data,  are  shown  to  be  directly  related  to  the  efficiency  of  each  method  of 
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THE  UBIQUITOUS  ROLE  OF  f'/f  IN  EFFICIENCY  ROBUST  ESTIMATION  OF  LOCATION 


Brian  L.  Joiner*  and  David  L.  Hall** 


1 .  Introduction 

Three  major  classes  of  estimators,  L,  M  and  R  estimators,  have  been  extensively 
studied  in  the  robustness  context  but  relatively  little  emphasis  has  been  placed  on  the 
similarities  and  differences  among  the  three  classes.  An  important  purpose  of  this  paper 
is  to  demonstrate  some  of  their  underlying  similarities,  and  in  so  doing,  gain  insight  as 
to  some  of  their  more  important  distinctions.  The  key  role  of  f'/f  in  these  matters  is 
emphasized. 


♦Brian  Joiner  is  Professor  and  Director  of  Statistical  Laboratory,  Department 
of  Statistics,  University  of  Wisconsin-Madison. 

**David  Hall  is  Senior  Research  Scientist,  Battelle  Northwest  Laboratories, 
Richland,  Washington. 
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2.  L,  M,  and  R  estimators 


In  this  section  we  give  brief  definitions  of  the  three  major  classes  of  loeati 
estimators  and  in  subsequent  sections  we  describe  the  relationships  amonu  these  thr 
classes  in  large  samples. 

L-estimators 

Of  the  three  classes,  the  simplest  to  explain  are  the  L-est imators .  An  : 

a  location  parameter  X  is  of  the  general  form 


*  n 
X  *  l 

i-1 


a.  X 
i  ,n 


(i) 


where  the  x(i)  are  the  ordered  observations  from  a  sample  of  size  n  and  the  a, 
weights  to  be  applied  to  the  various  order  statistics.  A  simple  example  of  an 
for  a  sample  or  size  4  is 

A  =  6X(1)  +  6X<2)  +  6  X ( 3 )  +  6X(4)  ' 

In  small  samples  from  known  distributions  the  optimal  weights  for  L-est imates 

derived  from  the  expectations  and  variance-covariance  matrix  of  the  order  statistic 

means  of  the  Gauss-Markov  theorem.  For  large  samples  it  is  convenient  to  represent 

i  G 

weights  by  defining  a  function  h(u)  on  (0,1)  such  that  a.  *  h  — -  f  h  — - 

l ,  n  '  n-*-  I  "  .  n+  i 

1=1 

the  data  have  cdf  F^(x)  =  F( x  -  X)  with  density  f(x  -  X)  and  if 

f • (x  -  X)  def  d f LS  —  ^ ) ,  then  under  regularity  conditions  it  can  be  shown  that  th 
-  o  X 

asymptotical ly  most  efficient  function  h(u)  for  data  from  F  is  qiven  by 

h(u)  =  qlF^fu))  , 

where  g(  x)  =  -  ^  (f'/f(x)),  and  F-1(u)  is  the  percent  point  function  or  inverse  cd 
Some  examples  of  optimal  *  functions  are  given  in  Exhibit  2 A.  The  optimal  L 
estimator  for  Gaussian  data  is  the  ordinary  sample  mean  and  that  for  double  expone^ 
data  is  the  median.  Trimmed  means  are  optimal  L-estimators  for  distributions  wit.'- 
middles  and  double  exponential  tails. 

^-estimators 

The  concept  of  V-estimators  (or  maximtim-1  i ke  1  ihood  like  estimators^  was  ittr-i  ( 
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Exhibit  2A 


Optimal  weight  functions  for  L  estimators  for  sever 

distributions 


Huber  (1964)*  In  general  an  M-estimator  is  defined  by  a  function  -^(x)  and  the  M-estimat.e 

of  the  location  parameter  \  based  upon  the  data  {x . }  is  given  by  the  value  of  X  such 
n 

that  ^  y( V  ~  X)  =  0.  The  asymptotically  most  efficient  M  estimator  for  data  from  a 
differentiable  density  f  is  the  maximum  likelihood  estimator  for  which  y  =  -f'/f. 

Some  examples  of  optimal  ^  functions  are  given  in  Exhibit  2B.  Note  that  for  the 
Gaussian  distribution  ^(z)  *  z  and  the  optimal  M  estimator  is  the  sample  mean,  while  for 
the  double  exponential  distribution  the  best  M-estimator  is  the  median.  For  a  distribution 
with  a  Gaussian  middle  and  double  exponential  tails,  the  maximum  likelihood  estimator  is  a 
metrically  trimmed  mean  in  which  X  must  be  calculated  iteratively  but  winds  up  being  the 
average  of  the  middle  observations  after  all  observations  such  that  |X^  -  X|  >  k,  are 
trimmed.  The  points  X  ±  k  are  those  at  which  the  Gaussian  portion  of  the  parent 
distribution  meets  the  double  exponential  portions.  This  metrically  trimmed  mean  is  often 
called  a  "Huber  estimator"  since  it  was  found  by  Huber  (1964)  to  be  the  minimax  estimator 
for  data  from  a  Gaussian  distribution  with  arbitrary  symmetric  contamination.  That  is,  it 
is  the  M-estimator  whose  worst  case  variance  is  minimized  over  the  class  of  distributions 
given  by  {[4\  -  e)C>  +  eH}  where  is  standard  Gaussian  and  H  is  symmetric  about  zero, 

but  otherwise  arbitrary. 

P.-estimators 

A  class  of  estimators  based  on  rank  tests  for  symmetry  and  known  as  R-estimators  was 
introduced  by  Hodges  and  Lehmann  (1963).  We  find  these  more  difficult  to  explain,  but  three 
statements  that  provide  concise  and  easily  understood  intuitive  definitions  for  some  are: 

Take  as  an  estimator  that  value  of  X  for  which  the  rank  test  scores  for  the  n 

values  (x  -  X),  (x  -  X),...,(x  -  X)  give  the  best  balance  relative  to  the  origin 
12  n 

(slightly  paraphrased  from  Lehmann,  1975,  p.  176); 

An  R-estinate  of  X  is  that  point  of  symmetry  that  is  least  rejectable  by  the 

specified  rank  test; 

A r.  F-estimate  for  X  is  the  midpoint  of  symmetric  confidence  intervals  for  X  based 
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Exhibit  2B 

Optimal  functions  of  ^-estimators  for  several 
distributions 


-5- 


Proceeding  more  formally,  consider  any  rank  test  for  symmetry  with  score  function 
J(u)  defined  on  0  <  u  <  1  with  J(u)  =  -J(1  -  u).  Then  for  any  trial  value  of  X, 

(a)  compute  { I -  X|}, 

(b)  find  ranks  of  { | X .  -  X|}, 

+  |  +,R<1xi  ~ 

(  c)  compute  S  (X)  =  l  sgn(x,  -  X)  •  J  , - - — - - J 

i=1  1  n 

where  J+(u)  =  J(j  +  j  u),  and  (2c) 

sgn(z)  =  +1,  if  z  >  0  , 

=  0,  if  z  =  0  , 

=  -1,  if  z  <  0  . 

Then  the  value  of  X  such  that  S  (X)  =  0  is  the  R-estimate  corresponding  to  the  score 
function  J(u).  If  there  is  no  value  of  X  such  that  S+(X)  =  0,  the  R  estimate  is 
usually  defined  as  the  mid-point  of  the  interval  between  the  largest  value  of  X  such  that 
S+(\>  <  0  and  the  smallest  value  such  that  S+(X)  >  0. 

The  optimal  score  function  for  data  from  the  distribution  F  is,  under  some 
regularity  conditions,  given  by 

J(u)  =  -f'/f(F-1(u>)  ,  (2d) 

Some  examples  of  optimal  score  functions  and  the  corresponding  rank  tests  and  R-estimators 
are  given  in  Exhibit  2C.  Most  R-estimators  must  be  solved  iteratively,  just  like  M- 
estimators.  Two  important  exceptions  are  the  optimal  R-estimators  corresponding  to  the 
double  exponential  and  logistic  distributions.  For  the  double  exponential  distribution  the 
optimal  rank  test  is  the  sign  test  and  the  corresponding  optimal  R-estimator  is  the 
median.  For  the  logistic  distribution  the  Wilcoxon  test  is  optimal  as  is  its  counterpart, 
the  Hodctes-Lehmann  (1963)  estimator,  defined  as  the  median  of  the  Walsh  averages 

(X.  +  X  ,)/2  for  i  <  j  . 

The  optimal  R-estimator  for  the  normal  distribution  must  be  solved  iteratively  and 
correponds  to  the  normal  scores  test  with  J(u)  =  $  Vu),  the  inverse  normal  CDF. 


1 


est  i  mat  or  ft 


I”  i-'r"  .non  we  enphasi  zo  the  v«.»ry  close  connections  between  optimal  L,  M  arrl  P- 
lmators  and  show  how  the  I.-M-ost  imators  of  pivest  (1*178)  are  a  natural  extension.  In 
cases  we  will  assume  tne  data  come  from  some  symmetric  density  f  known  up  to  a 
at  ion  parameter  X  and  appropriate  regularity  conditions  on  f  will  be  assumed  to 

A  , 

Optimal  v  estimators 

As  mentioned  in  the  preceding  section  and  as  is  well  known  the  optimal  M-estimator  for 

a  from  f  is  given  by  the  maximum  likelihood  estimator  a  defined  by 

M 


I  f/f(x.  -  XM)  -  n  . 


.ira*  F-estimators 


The  optimal  R-estimator  given  by  (2c  and  2d)  can  be  shown  to  be  quite  similar  to  the 

:~m!  V.-estimator ,  exceDt  that  the  actual  deviations  {(x.  -  X)}  in  the  M-estimator  are 

i 

laced  by  "predicted  deviations";  that  is,  by  the  deviations  one  would  predict  based  on 
wl edge  of  f  and  the  ranks  of  the  absolute  values  of  the  deviations.  Suppose  we  define 


(X .  -  X)  =  san(X  —  X ) 


RflX.  -  a ( ) 
r.  *  1  ' 


(3a) 


F,  is  the  cdf  of 


X|  for  the  given  f,  and  F  is  its  percent  point 


: inn. .  Then  the  optimal  rank  estimate  X  is  simply  the  solution  X  of 

R  P 


f’/f(X.  -  X)  *  0  . 

IP 

True  an  M-estimator  makes  use  of  the  actual  deviation  while  a  rank  estimator  must 

the  actual  deviation  by  some  function  of  its  rank.  Since  symmetry  has  been  assumed 
est  value  t.o  use  is  the  value  one  would  predict  based  solely  on  knowledge  of  the  ranv 
_t b so  1  ut *i  "alue  of  t1'-'  deviation.  This  is  an  estimate  of  how  big  the  absolute  value 
i-  dev.  i'-iv;  "sovild  h.a/e  beeri"  for  data  from  the  known  f.  In  formula  (3a)  the  sgr. 

.  •;  keens  t  ra-*>  of  which  side  of  \  the  deviation  cane  from,  while  the  other 

......  -  vr.  i  f*i  do  •  ‘  tv-*.  deviation  one  would  predict. 


oops 


For  L-estimators  a  somewhat  analogous  result  can  be  shown  to  hold,  except  here  the 
actual  deviation  is  used  just  as  it  was  for  M-estimators ,  while  f'/f  is  approximated  by  a 


locally  linear  function.  To  see  this  suppose  we  start  from  the  M-estimator 

f'/f(X(i)  -  X)  =  0  evaluated  now  for  the  ordered  observations  {x^}.  Taking  some 
particular  deviation,  say  (X^^  ”  we  seek  a  linear  Taylor  series  approximation  for 

the  function  f'/f(X^  -  X).  Expanding  about  the  value  (X^  *  X)°,  let 


3f'/f (X  -X) 

f/f(X(i)-X)  =  fVf(X(i)-X,0  +  [  (X(i)-X)  -  CX(i)-X)°]  - ^ 


(X-X) 


(x(i)-x,0 


Now  choose 

(X(D  -  X)°  =  F'^ 

as  the  value  about  which  the  Taylor  series  expansion  is  taken.  Then 

f,/f(x(i,'X)  =  f/fCr-1^))  ♦  (x(i)-X)h(^)  - 


where  h(u)  =  3fg^f  (F  1  ( u )  )  as  given  by  (2b).  But  5  f ■  /f (f_1  (yjy) )  =  0  since  this  is 
just  the  MLE  of  the  center  of  a  distribution  symmetric  about  zero  with  "data"  equal  to  the 
symmetric  quantiles.  Similarly  £  h(yyy)  •  F  1 (^yy)  =  0  since  this  is  simply  an  L- 
estimate  of  the  same  center  for  the  same  quantile  "data".  Thus 

I  r/f(x(i)  -  X,  :  I  h(J;T)  .  (X(i)  -  x,  . 

When  the  right  hand  half  of  this  is  set  equal  to  zero,  it  yields,  the  L-estimator 


n 


n 


1=1  ]=1 


This  process  can  be  seen  more  clearly  in  Exhibit  3A.  There  the  curved  line  is  the  ^ 
function  that  is  asymptotically  optimal  for  a  Tukey  lambda  variate  with  parameter  -0,5.* 


*  A  Tukey  lambda  variate  z  with  parameter  y  is  defined  by  the  equation 
z  =  fu'  -  (1  -  U)/]/v  where  U  is  uniform  on  (0,1).  See,  e.g.,  Joiner  and 
posenblatt  (1971). 


Kelationship  between  asymptotically  equivalent  M  and  1.  estimators 


The  circles  on  the  horizontal  axis  are  at  the  points  F  1  for  n  =  *>  observations 

and  a  trial  value  of  a.  Tangent  lines  are  drawn  to  V  at  these  points.  These  tangent 

lines  are  the  small  sample  analog  of  those  that  define  the  asymptotically  equivalent  L- 

estimator.  The  x*s  on  the  horizontal  axis  denote  the  observed  data.  For  the  M-estimator 

one  computes  E  v  at  the  data,  moving  the  ty  curve  to  the  left  or  right  until  *  0. 

The  center  of  the  -y  curve  then  gives  the  M  estimator  A  . 

M 

For  the  L-estimator  the  process  is  conceptually  analagous  except  that  the  tangent  line 
approximations  are  used  rather  than  ']>  itself.  The  smallest  value  of  x  uses  the  first 
tangent  line,  the  second  smallest  uses  the  second  tangent  line  and  so  on.  Even  in  this 
small  sample  it  is  clear  that  there  is  little  difference  between  the  ip  weights  and  the 
weights  from  the  tangent  lines.  Note  also  that  the  tangent  lines  are  not  used  in  the  same 
fashion  as  the  customary  piecewise  linear  approximation.  The  smallest  data  value  uses  the 
first  tangent  line  no  matter  how  far  out  (or  in)  that  point  might  fall. 

To  sun  up,  when  one  knows  the  parent  distribution  there  is  a  very  close  connection 
among  the  asymptotically  optimal  L,  M  and  R-estimators.  The  M-estimator  is  the  maximum 
livelihood  estimator,  with  ip  s  f’/f  and  the  L  and  R-estimators  are  defined  by  simple 
approximations.  This  close  connection  warrants  summarization. 

Optimal  M-estimator  for  f  is  value  of  A  such  that 

l  f/f(X.  -  A)  =  0  ; 

l 

Optimal  R-estimator  for  f  is  value  of  A  such  that 

f/f  (predicted  value  of  (X,  -  A)|f  and  rank  of  |X.  -  A|)  =  0? 

l  l 

Optimal  L-estimator  for  f  is  value  of  A  such  that 

L  [linear  approximation  of  f'/f](X^)  -  A)  =  0. 

L-M  estimators 

Seeing  the  intimate  connection  among  these  estimators  leads  one  to  think  of  broader 
classes  of  estimators  that  would  combine  or  include  these  three.  The  work  of  Rivest  (1978) 
provides  one  such  ^lass.  Rivest  has  studied  a  class  of  L-M  estimators  defined  as  the 


solution  o 


n*  1 


'  n+1 


*,xm  '  X)  '  0  • 


These  estimators  would  seem  to  combine  features  of  both  L  and  M-estimators.  As  one  miqht 
conjecture,  these  estimators  turn  out  to  be  asymptotically  equivalent  to  maximum  likelihood 
estimators  at  F  if  H  and  ij>  are  such  that 

Mu)  -^F-’.u,  . 

dX  dX 

That  is,  the  product  of  h,  the  L  component  weight  function,  and  the  slope  of  V#  the  M 
component  function,  must  be  identical  to  the  derivative  of  the  maximum  likelihood  score 
function.  Thus  an  asymptotically  optimal  L-M-estimator  with  h  and  ^  functions  defined 


by  h(u)  *  f"*(u)  -  [■—- 


/f 


X -  F  (u)  .  would,  in  some  sense,  be  “half"  M  and  "half"  L. 

dx  L  3x  J 


♦ 


i 


4.  Heuristic  view  of  f'/f 


In  the  preceding  section  we  saw  that 

f'  3f  (X  -  X) 


♦  f (X  -  X  ) 


f  3X 

is  the  key  quantity  in  efficient  estimation  of  location,  be  it  SI,  L,  or  P  or  even  L-M- 
estimation.  This  important  fact  seems  not  to  be  widely  appreciated  even  though  it  is 
implicit  in  many  sources.  In  this  section  we  give  a  heuristic  view  as  to  why  it  is 
eminently  plausible  to  base  estimates  on  f'/f.  This  intuitive  motivation  is  Intended  to 
compliment  that  of  the  likelihood  approach. 

In  the  likelihood  approach  one  starts  from  the  fact  that  the  "probability”  of  the  data 

n 

for  any  given  value  of  X  is  II  f(X  -  X).  One  then  finds  the  value  of  X  that 

1-1  1 


maximizes  the  "probability  of  the  data".  Taking  logarithms  and  differentiating  leads  to 

r*  f  * 

the  familiar  i  (X^  -  X)  -  0  as  defining  the  maximum  likelihood  estimate  of  X.  Ever 

after  seeing  this,  many  of  us  still  have  little  "feel”  as  to  why  f'/f  "should  be"  the 
defining  characteristic. 

Here  we  qive  an  alternative  view  that  seems  to  be  plausible  enough  even  for  many 
students  taking  their  first  course  in  statistics.  The  exposition  is  all  in  the  context  of 
estimating  the  location  of  a  symmetric  distribution  known  up  to  its  point  of  symmetry, 
however  much  is  immediately  generalizable  to  broader  classes  of  estimation  problems. 

The  process  of  estimation  can  be  viewed  as  essentially  the  matching  of  a  density  with 
an  observed  histogram.  One  might  imagine  the  density  function  in  Exhibit  4A  beina  moved 
along  the  horizontal  axis  until  it  provides  a  good  match,  in  some  sense,  with  the  observed 
data.  Once  the  "beat"  match  has  been  found,  the  location  estimate  becomes  the  center  of 
symmetry  of  the  density  function. 

The  role  of  f'/f  in  efficient  estimation  of  location  is  thus  to  determine  w'-en  a 
good  match  has  been  obtained.  To  see  how  this  is  accomplished  consider  Exhibits  4B  and  47. 
These  present  two  microscopic  views  of  the  interrelationship  between  the  data  and  the 
theoretical  density  function  at  different  regions  of  the  horizontal  axis.  The  amount  o' 
information  available  locally  concerning  the  incremental  movement  of  the  densitv  relati'/e 
to  the  data  is  quite  different  at  the  two  sites.  In  Exhibit  4B  the  local  portion  of  the 


Exhibit  4A. 


Heuristic  viev.'  of  estimation  of  location,  *r*e  censi  - 
is  moved  alone  until  it  "matches"  the  data  as  well  as 
possible.  Matches  are  determined  by  the  balancing _ c: 
stresses  due  to  the  relative  steepness  of  the  dehoi^> 
function  at  various  points  along  the  horizontal  axis, 

i.e.  by  malting  I'f  =k.  The  center  oi  symmetry  o 
der.sit”  is  then  the  estimate  of  location  for  the 
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Exhibit  4B 
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Close-up  of  portion  of 
density-data  where 
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density-data  where  7  is 


Close-up  views  of  two  portions  of  the  density  function 
anc  the  data  it  is  seekir.^  to  fit;  one  view  ur.ere 
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—  is  essentially  zero,  ar.c  thus  where  tnere  is  iittie 
local  infor.tation  about  whether  the  censity  is  too  tar 

left  or  ri,pht  relative  to  the  uata;  and  one  where 

is  lar^e  and  r:.uch  can  ce  learned  locally  about  appro¬ 
priate  placement  of  the  uensity  function  relative  to 
tne  uata . 


density  would  fit  the  local  data  almost  as  well  if  the  density  were  moved  slightly  to 
either  side.  Thus  there  is  very  little  local  information  that  could  be  used  to  determined 


the  value  of  X , 

In  Exhibit  4 C  the  situation  is  quite  different.  In  this  case  any  movement  of  the 

density  would  markedly  decrease  the  quality  of  the  match  between  the  density  and  the  data, 

and  thus  there  is  much  relevant  local  information  concerning  estimation  of  location. 

The  main  difference  between  Exhibits  4B  and  4C  is  in  the  relative  steepness  of  the 
local  part  of  the  density  function,  i.e.  in  the  magnitude  of  f'/f.  However,  even  after 
considering  this,  one  might  still  question  why  f'/f  rather  than,  say  f'  alone  is  the 
key  characteristic  in  estimation  of  location.  While  the  answer  is  not  obvious  it  does  seem 
heuristically  reasonable  that  the  height  of  a  histogram  should  also  be  relevant  since  a 
given  amount  of  tilt  at  the  top  of  a  tall  histogram  may  well  be  less  informative  than  the 
same  amount  in  a  very  short  one. 

Thus,  seen  from  this  view,  the  role  of  f'/f  is  to  measure  the  relative  steepness  of 

the  density  function  and  to  express  the  amount  of  resistance  the  data  exerts  to  having  X 

moved  away  from  it.  Once  X  is  near  the  correct  value,  there  will  exist  stresses  from 
f'/f  on  both  sides  of  For  any  given  sample,  the  value  of  X  that  balances  these 

stresses  will  be  the  location  estimate  for  that  particular  sample. 

It  is  interesting  to  review  these  "stress  functions"  for  several  well  known 
distributions.  Exhibit  2B  presents  -f’/f  for  several  distributions.  When  considered 
from  the  above  viewpoint,  the  fact  that  for  the  Gaussian  distribution  -f'/f  is  exactly 
proportional  to  the  size  of  the  deviation,  is  mite  remarkable.  The  tails  of  a  Gaussian 
distribution  thus  oet  ever  mcreas  i  na!  v  s*eer,  as  one  moves  further  away  from  its  center  in 
direct  proportion  to  the  distance.  Hence  for  a  Gaussian  distribution  the  amount  of 
resistance  exerted  to  rc-,e*“r*n-  rC  ■  away  from  an  observation  increases  in  direct 
proportion  to  tk^  ravn:*  ;de  o e  t-e  deviation,  "vis,  of  course,  is  the  cause  of  the  well 
known,  phenomenon  that  out’  yin-:  of  sorva-  i  or.c  w  a”'-  ;r^*at  effect  on  location  estimation  for  V 
estimators  has**d  on  th'-  'a  c  '  .a-  an-;  ,o . 


The  double  exponential  distribution,  on  the  other  hand,  has  tails  of  constant  relative 
steepness.  Thus  the  stress  exerted  on  X  does  not  depend  at  all  on  how  far  out  in  the 
tails  an  observation  is.  All  that  matters  is  what  side  of  X  the  observation  is  on. 

The  logistic  distribution  differs  from  the  double  exponential  in  its  central  portion 
but,  as  can  be  seen  from  Exhibit  2B,  its  tails  are  asymptotically  equivalent  to  those  of 
the  double  exponential.  Thus,  once  X  is  a  substantial  distance  from  an  observation, 
moving  it  an  arbitrary  amount  further  away  makes  virtually  no  difference  in  the  amount  of 
force  exerted  on  X  by  that  observation  under  logistic  assumptions. 

The  Cauchy  distribution  is  different  yet:  observations  at  an  intermediate  distance 
from  X  exert  the  most  force  while  those  further  out  exert  almost  none.  Under  Cauchy 
assumptions,  once  an  observation  is  far  enough  away  from  X,  not  even  the  side  it  is  on 
matters  much.  Cauchy  tails  are  asymptotically  flat,  like  a  uniform  distribution,  and  thus 
contain  essentially  no  information  on  location.  The  greatest  information  concerning 
location  in  a  Cauchy  distribution  is  in  the  "shoulders"  of  the  density  which  fall  off 
rather  sharply. 

The  uniform  distribution  itself  represents  a  limiting  case  in  another  direction. 

Here,  f'/f  shows  that  the  middle  part  of  the  data  contain  no  information  on  location 
while  the  endpoints  contain  "infinite"  information,  thus  X  must  exactly  balance  the  two 
endpoints,  leading  to  the  well  known  result  that  the  midrange  is  the  MLE  and  asymptotically 
most  efficient  L  estimator  for  the  uniform  distribution. 

The  point  of  view  discussed  above  is  naturally  quite  closely  connected  to  the  idea  of 
the  influence  curve  introduced  by  Hampel  (1968).  Hampel  noted  that  if  a  derivative  of  the 
functional  defining  an  estimator  was  taken,  the  resulting  function  fi(X,F)  could  be 
interpreted  as  representing  how  much  effect  an  observation  at  x  would  have  on  the 
estimator  with  data  from  F. 

A  Mechanical  Analogy 

These  observations  lead  us  to  propose  an  alternative  mechanical  view  of  estimation  of 
location.  The  conventional  view  is  that  of  finding  the  balance  point  of  a  scale  in  which 


blocks  of  equal  mass  have  been  placed  at  each  data  point.  This  analogy  is  exact  for  twe 


Gaussian  case  and  can  be  modified  to  work  in  other  cases  by  using  an  analogy  with  weighted 
estimation  (c.f.,  Andrews,  1974). 

An  alternative  view  that  seems  to  provide  different  insight  is  that  of  the  sane 
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balance,  but  with  blocks  of  mass  proportional  to  |—  (X^  -  X)|  being  placed  at  ±1, 
depending  on  the  sign  of  (X^  -  X).  The  masses  represent  the  amount  of  stress  or  force 
being  exerted  by  the  various  data  points  as  a  function  of  the  relative  steepness  of  f'/f 
at  that  distance  from  X.  For  the  double  exponential,  all  masses  would  be  of  the  same  size 
so  that  X  only  seeks  to  have  an  equal  number  of  masses  on  each  side.  For  the  Cauchy,  the 
masses  would  first  increase  then  decrease  in  size. 

Other  Uses  of  f'/f 

That  f'/f  plays  a  key  role  in  estimation  is  of  course  not  new.  xs  explicitly  the 

central  quantity  in  maximum  likelihood  estimation  and  is  at  least  implicit  in  L  and  p- 
estimation.  Fisher  information  being  equal  to  E(f'/f)^  is  thus  a  measure  of  the  "amount" 
of  f'/f.  The  Cramer-Rao  lower  bound  for  the  variance  of  an  estimator  of  location  being 

equal  to  - - - r*  has  an  analogous  interpretation. 

E(f'/f) 

Stein  (1956),  Stone  (1975)  and  others  have  shown  that  at  least  asymptotically  for 
symmetric  distributions  it  is  possible  to  estimate  f'/f  from  the  data  and  thus  gain  full 
asymptotic  efficiency  for  data  from  any  symmetric  distribution,  subject  to  mild  regularity 
constraints.  In  fact  Stone's  results  can  be  said  to  be  promising  even  for  the  location 
problem  even  in  samples  as  small  as  40. 

Huber's  Minimax  Result 

A  less  obvious  situation  in  which  f'/f  appears  to  be  key  is  associated  with  Huber’s 
(1964)  minimax  estimator  for  a  type  of  contaminated  data.  Huber  proposed  the  followir.a 
problem:  suppose  F  is  a  symmetric  distribution  known  up  to  a  location  parameter  and 

is  any  other  distribution  symmetric  about  the  same  point.  Then  consider  the  class  of 
contaminated  distributions  {(1  -  e)F  +  gh}  where  c  is  fixed.  He  asked,  which  fixed 
estimator  has  the  best  worst-case  variance  with  resnect  to  this  class  of  distributions? 

Huber  showed  that  if  F  had  a  stronqlv  unimodal  density,  i.e.  if  -f'/f  were 
monotonic,  then  under  mild  regularity  conditions  the  solution  was  given  by  an  M  estimator 


of  the  form 


N 

T  imx.  -  x>  =  o  , 
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where 

>Mx)  =  -  |—  (x),  for  lx|  <  c  , 

=  -sgn(x>  for  |x|  >  c  , 

and  where  c  is  determined  by  £.  Huber  interpreted  this  result  as  a  sort  of  "fattening 
up"  of  F's  tails.  However,  it  seems  more  pertinent  to  view  it  as  the  removal  of  the  most 
informative  part  of  f'/f.  For  example,  if  F  is  Gaussian  then  -f'/f  is  a  straight  line 
with  positive  slope.  The  minimax  estimator  for  the  above  class  is  known  as  a  "Huber"  and 
has  the  \(i  function  shown  at  the  bottom  of  Exhibit  2B. 

Thus  nature's  best  strategy  is  to  take  as  H  that  distribution  which  places  all  of 
its  mass  in  the  portions  of  F  that  have  the  greatest  relative  steepness  in  such  a  way  as 
to  make  those  portions  of  the  resulting  density  exponential.  Hence  the  worst  possible 
Huber-type  contaminated  normal  has  a  Gaussian  middle  and  double  exponential  tails,  and  has 
as  its  maximum  likelihood  estimator  the  M  estimator  defined  above. 

A  Conjecture 

Huber's  proof  makes  critical  use  of  the  assumed  strong  unimodality  of  F  and  thus 
does  not  apply  when  F  is  a  distribution  like  the  Cauchy.  However,  we  conjecture  that 
Huber's  result  holds  in  a  broader  class  of  distributions  in  the  sense  that  the  Huber-type 
minimax  estimators  for  any  distribution  will,  under  reasonable  regularity  conditions,  be  of 
the  form 

f ' 

iji(  x)  =  -  —  (x)  ,  x  e  A  , 

=  -sgn(x)  •  K,  x  (  A  , 

where 


A  =  ;x  :  l|—  (x)  |  t  K}  . 


5,  f'/f  and  relative  efficiency  of  estimation 

In  Section  3  we  observed  that  f'/f  is  the  key  quantity  in  defining  a  fully  efficient 
estimator  of  location.  In  this  section  we  investigate  the  problem  of  relative  efficiency 
of  estimation/  where  one  uses  an  estimator  optimal  for  data  from  F ,  but  applies  it  to 
data  which  actually  came  from  some  other  distribution  G.  We  show  that  in  such  cases  the 
asymptotic  relative  efficiency  (ARE)  of  L ,  M  and  R-estimators  are  all  determined  by 
correlation  coefficients  between  f'/f  and  g'/g.  The  difference  in  efficiencies  among  L, 

M  and  R-estimators  is  shown  to  be  a  matter  of  the  "data"  at  which  f'/f  is  evaluated. 

This  provides  us  with  insight  as  to  differences  among  the  three  classes. 

Efficiencies  as  Squared  Correlations 

Correlation  coefficients  occur  frequently  in  efficiency  calculations.  Cramer  (1945) 
showed  that  if  T1  was  an  efficient  estimator  and  T2  was  a  regular  unbiased  estimator/ 
then  the  square  of  the  correlation  coefficient  between  the  estimators  gave  the  efficiency 
of  T 2»  Noether  (1955),  Hajek  (1962),  and  van  Eeden  (1963)  extended  this  result  and  showed 
in  different  situations  that  the  Pitman  efficiency  of  certain  tests  was  given  by  the  square 
of  the  correlation  coefficient  between  the  test  statistics.  In  the  context  of  rank  tests 
this  correlation  between  the  rank  statistics  reduces  to  the  correlation  between  the 
asymptotic  score  functions  corresponding  to  the  tests  (Hajek,  1962).  (Note  that  it  is 
often  much  easier  to  compute  the  correlation  between  score  functions  than  it  is  among  the 
estimators  which  they  define.) 

R-Estimators 

First  we  show  that  the  ARE  of  a  rank  estimator  corresponding  to  an  arbitrary  score 
function  is  given  by  the  square  of  a  type  of  correlation  coefficient.  Hajek  (1962)  showed 
that  when  the  two  sample  rank  test  based  on  the  score  function  J(u)  is  applied  to  data 
from  a  distribution  F,  the  ARE  of  the  test  based  on  J  with  respect  to  the 
asymptotically  most  powerful  rank  test  (amprt)  for  the  distribution  F  is  given  by  the 
square  of  a  correlation  coefficient,  namely 
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02(J(U),  -f '/ftF*1 (u) ) 


[/  J(u)(-f,/f(F*1(u)ldu!2 
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/  J2(u)du  /  f-f,/f(F'1(u) )] 2du 

0  0 


(5.1) 


van  Eeden  (1963)  proved  a  similar  result  for  the  one  sample  test  for  symmetry.  Hajek's 
result  (5.1)  is  also  true  for  the  corresponding  rank  estimators,  as  will  now  be  shown 
directly. 

We  assume  throughout  this  section  that  all  distributions  considered  are  symmetric  and 
unimodal  with  finite  Fisher  information  and  a  differentiable  density. 

Theorem  5.1;  If  the  R-estimator  with  score  function  J(u)  is  used  on  data  from  a 
distribution  F,  the  ARE  of  the  estimator  based  on  J(u)  relative  to  the  R-estimator 
corresponding  to  the  amprt  for  F  is  given  by  (5.1). 

Proof:  We  can  assume  without  loss  of  generality  that  F  has  been  scaled  so  that  its 
Fisher  information 


1(F)  =  /  t-f'/f(F~\u)  )jdu 

0 

is  unity.  The  asymptotic  variance  of  the  rank  estimator  with  score  function  J(u)  or.  the 
distribution  F  is  given  by 

00 

I  J2 (F( x) ) f ( x! dx 
—00 
oo 

[/  J' (F(x) )f 2(xldx: 2 
—00 

Now  since  1(F)  =  1,  the  ARE  of  this  rank  estimator  with  respect  to  the  efficient  rank 
estimator  for  F  is  just  the  reciprocal  of  a^. 


Thus 
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arer<j|f>  »  2 
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[/  J'(F(x))f2(x)dx]2  [J  J  •  (  u)  f  (  F-1  ( u )  )  r9u’  2 
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/  J2 ! F( x) ) f ( x) dx 


;  J  (u)du 
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which  after  integrating  the  numerator  by  parts  and  recalling  that  1(F)  =  1  gives 


t/  J(u)f’/f(F'1(u))du]2 

arer  (j|  f)  =  -T— d - i - ; - ~ 

/  J  (u)du  •  /  [-f'/f(F  (u))J  du 

0  0 


which  is  t'  a  desired  result. 

In  the  above,  note  that  if  J  is  the  score  function  for  the  amprt  corresponding  to 

some  distribution  G  then  the  expression  for  the  ARE  becomes 

ARER(G|  F)  =  p2(-f'/f(F_1(u)  ),  -gVg(G'\u) )  ) 

1  -1  , 

[/  g’/g(G  (u))f'/f(F  ( u) ) du] 

-T-2 - T - -•  <>■» 

I  [f'/f(F'1(u)l  2du  J  [g'/g(G_1(u)l  "du 
0  0 


Note  that  this  is  the  square  of  the  correlation  coefficient  between  f’/f  and  g ’ /g 
with  each  being  evaluated  at  its  own  data.  Also  note  the  reflexivity  of  the  ARE  for  rank 
estimators.  That  is  (5.2)  represents  the  ARE  of  the  rank  estimator  with  score  function 
-g'/g(G  1 (u) )  on  the  data  from  F  as  well  as  the  ARE  o*  the  rank  estimator  with  score 
function  -f ' /f ( F**1  ( u)  )  on  data  from  G.  As  an  example,  the  best  rank  estimator  for 
Gaussian  data,  the  normal  scores  estimator,  has  the  sane  ARE  on  logistic  data,  0.95,  as-  th* 
best  rank  estimator  for  logistic  data ,  the  Hodges-Lehmann  estimator  has  on  Gaussian  daru. 

The  P  efficiencies  for  a  number  of  other  pairs  of  distributions  are  computed  ir.  Hall 
and  Joiner  (I98^b)« 


ts 


Correlations/  Angles  and  Efficiencies  of  R-estimators 

For  R  estimators  Gastwirth  (1966)  has  noted  that  the  score  function  J(u)  of  a  rank 
estimator  may  be  thought  of  as  an  infinite  dimensional  vector.  The  score  function  for  the 
efficient  R  estimator  for  F  is  given  by  -f'/f  F*^(u)/  which  may  thus  also  be  thought 
of  as  an  infinite  dimensional  vector.  The  square  of  the  cosine  of  the  angle  between  these 
two  vectors  is  the  ARE  of  J  applied  to  data  from  F.  This  relationship  between  the  ARE 
of  R  estimators  and  the  angles  between  score  functions  is  further  developed  in  Hall  and 
Joiner  (1980c). 

M  Estimators 

For  M  estimators  a  similar  but  different  result  is  attained: 

Theorem  5.2:  The  ARE  of  the  M  estimator  defined  by  the  square  integrable  function  i^(x) 
on  data  from  F  with  respect  to  the  efficient  M  estimator  for  F  is  given  by 

ARE  (u)  >,  -f’/f  (F-1  (U)  )  ) 

n 

tj  iMP'1(u))(-f'/f(F'1(u)))du]2 
_ 0 _ 

/  ii.2(F"1(u))du  /  <-£'/f(F~\u)))2au 

0  0 


Proof :  The  ARE  of  the  M-estimator  corresponding  to  ty(x)  with  respect  to  the  efficient  M 


estimator  for  F,  which  corresponds  to  -f'/f(x),  is 
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(5.3) 


which  is  the  desired  result.  Hence  the  ARE  of  an  M  estimator  is  also  given  by  the  square 
of  a  correlation  coefficient. 

From  the  above  theorem  the  ARE  of  the  optimal  M  estimator  (i.e.  the  maximum  likelihood 
estimator)  for  G  when  applied  to  data  from  F  is  given  by  the  squared  correlation 
coefficient 

AREm(G|F)  =  p2(-g,/g(F'1(u)),  -f'/f(F'\u>)  . 

As  with  R-estimators,  the  ARE  is  the  squared  correlation  coefficient  between  f'/f  and 
g'/g.  The  big  difference  here  is  that  f'/f  and  g'/g  are  both  evaluated  at  the  actua I 
data . 

The  Role  of  Scale  in  M  Estimation 

In  M  estimation  the  scale  of  the  data  makes  an  important  difference  in  estimation  of 
location.  For  example,  the  ARE  of  the  M  estimator  ^/(x  -  A)  when  applied  to  iata  from 

,X  -  \  s 

F v — - — -J  depends  very  much  on  the  value  of  cr.  In  R  and  L  estimation,  the  value  -  f 
o  is  not  a  factor  in  determining  ARE.  This  independence  of  scale  in  L  and  P 
estimation  is  a  convenience,  both  practically  and  theoretically. 

An  illuminating  example  of  M  scale  dependence  is  afforded  by  the  family  of  scaled 
t  distributions.  For  this  family  the  optimal  $  functions  all  have  identical  shape: 

*  -  (v  +  n - H—  , 

u'  0  [i  ♦  U21 


where  u 


'  vc 


The  roles  of  a  and  v  are  thus  totally  confounded  in  M  estimation 


for  the  t  family.  One  can  achieve,  for  example,  100%  efficiency  with  the  ^  for,  any 
t,  say  the  Cauchy,  on  data  from  any  other  t  just  by  using  a  "wrong"  value  of  a. 

Lack  of  Reflexivity  of  M  Efficiencies 

The  reflexivity  of  efficiencies  that  R  estimators  possess  is  not  attained  by  M 

estimators.  That  is,  the  ARE  (F|G)  is  in  general  different  from  ARE  (G|F).  The  amount 

“  M 

of  difference  depends  in  general  on  the  scaling  of  the  distributions.  For  example,  in 

Exhibit  5A  we  see  that 

ARE  (Cauchy | logistic)  is  not  equal  to 
M 

ARE  (logistic | Cauchy)  for  any  of  the  four  choices  of  scale  considered. 

M 

For  R-estimators, 

2 

ARE  (logistic | Cauchy)  =  ARE  (Cauchy | logistic)  =  6/t  =  60.79%. 

K  R 

An  even  more  extreme  example  of  lack  of  reflexivity  is  provided  by  the  Gaussian  and 
Cauchy ,  where 

ARE  (Gaussian | Cauchy)  =  0,  for  all  choices  of  scale. 

M 

On  the  other  hand  AREw< Cauchy | Gaussian)  is  positive  for  all  choices  of  scale,  and  is  57% 
When  the  two  distributions  are  expressed  in  their  standard  form.  The  arer  is  43%,  either 
way,  no  matter  what  scale  is  used. 

L  Estimators 

The  ARE  of  L  estimators  is  also  given  by  a  squared  correlation  coefficient. 

Theorem  5.3:  The  ARE  of  the  L  estimator  with  weight  function  h(u),  where 
1 

h ( 1  -  u)  =  h(u)  and  /  h(u)du  =  1,  on  data  from  F  with  respect  to  the  efficient  L 
0 

estimator  for  F  is  given  by 

AREL(h|F)  =  02[A((h,F)(u)),  -f ' /f (F_1 ( u ) ) ] 

where 

^  U  I  t  \  u  « 

A!h,F)(u)  =  /  - - -  dt  =  /  h(t)d(F~  ( t ) )  . 

1/2  f ( F~  (t))  1/2 


(5.4) 


Exhibit  5A 


Illustration  of  strong  dependence  of  asymptotic  relative  efficiency 
of  M  estimators  on  choice  of  scaling  function.  All  but  one  of  the  scaling 
functions  are  analogous  to  the  median  absolute  deviation  in  that  they 
are  a  percentile  of  the  { |y^ -y,- 1 ).  For  example,  MAD  =  S  ,n  = 

[0.50  quantile  of  { | y^ -y^  | }].  Efficiencies  for  the  same  y  function  on  the 
same  data,  range  from,  e.g.,  56%  to  75%  depending  on  the  choice  of  scaling 
function. 

(a) 

Maximum  likelihood  estimator  for  Cauchy  applied  to  logistic  data,  and 
vice  versa.  Note  that  for  the  scaling  function  S  ^ ,  both  M  estimators 

have  efficiencies  higher  than  the  60.8%  of  their  rank  counterparts. 


Efficiencies  (in  %) 


Estimator 
MLE  for 

Applied 

to 

data  from 

Scaling  function  * 

S.l 

S.50 

S.67 

(Info)'5* 

Cauchy 

logistic 

81.6 

77.0 

71.2 

BO. 6 

logistic 

Cauchy 

61.4 

57.2 

_ 

52.2 

60.4 

Each  S  was  multipled  by  a  k  such  that  100%  efficiency  was  attached  by  the 
MLE  on  its  own  data.  Thus,  for  example  the  logistic  estimator  had 
k  =  [S  (on  logistic  data)]-'. 
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(b) 


Tukey  bisquare  applied  to  Student  t  data.  The  tuning  constant  k 
in  the  bisquare  was,  in  each  case,  selected  to  produce  951 
efficiency  on  normal  data. 


!  Efficiencies  {in  %) 


Scaling  function 

Data  from 

S.25 

S.50 

S.75 

k 

15.268 

7.213 

4.229 

Cauchy 

75.3 

70.2 

56.3 

t  with  v=  2 

90.8 

89.7 

86.1 

v-  3 

95.7 

95.3 

94.1 

v=  5 

98.4 

98.4 

98.3 

o 

ii 

> 

98.7 

98.8 

99.0 

v=30 

97.2 

97.3 

97.4 

normal 

95.0 

95.0 

95.0 

The  results  in  this  Exhibit  are  due  to  Lane  Bishop. 


1 


Proof:  Since  A(h,F)(u)=/U  - dt,  ht  u)  -  f  ( F'1  ( u) )  A’  ( h  ,F)  (u) ,  where 

1/2  f(F  (tl)  1 

A'(h,F)(u)  =  4-  A(h,F)(u).  Thus,  since  /  h(u)du  *  1, 
du  0 

1 

/  f(F  (u) ) A* (h,F) (u)du  =  1  . 

0 


Integrating  by  parts  gives  that 


-  f  A(h,F)(u)d(f(F  (u)))  -  1  . 
0 


Now  assume  without  loss  of  generality  that  1(F)  =  1  and  recall  (see,  e.g.,  Huber,  1972) 

2  1  2 

that  the  asymptotic  variance  of  an  L  estimator  is  given  by  °L  =  /  A  <h,F)(u)du.  Then 
the  ARE  of  the  L  estimator  h(u)  on  data  from  F  is  given  by 
ARE  (h|F)  =  ~z 

1  -12 
[/  A(h,F)(u)d(f(F  (u)))) 

0  _ 

"  1  , 

/  A  (h,F)(u)du 
0 


-1  2 

/  A(h,F)  (ulf'/ffF  (  u)  )  du) 

_ 0 _ _ _ 

1  1  12 

/  A  (h,F)(u)du  /  [f'/f(F  (u)))  du 

0  0 


which  was  to  be  shown. 

When  h(u)  =  (-g'/g) ' (G_1 (u) ) ,  so  that  it  is  the  optimal  weight  function  for  the 

distribution  G,  the  ARE  of  it  on  data  from  F  is: 

ARE  (G  |  F)  =  02(A(-(g'/g)'(G"1(u)),F)(u),  -f’/f(F  \u))]  . 

L 

If  we  let  hf  denote  the  efficient  weight  function  for  F,  we  have 


dt 


u  h  (t) 

A(h  ,F)(u)  -  /  - - - 

1/2  f(F~  (t) ) 

u 

=  -  /  7-  (-f '  /f  ( F~ 1 ( t ) ) >dt 

1/2  dt 

=  -f'/f IF*1 (u) )  . 

If  we  now  let  hg  denote  the  efficient  weight  function  for  G,  equation  (5.4)  can  be 
expressed  more  symmetrically  as 

ARE(G|F)  =  p2(A(h  ,F)  (u)  ,A(h_,F)  (u)  )  . 

L  g  1 

Like  M-estimators,  L-estimators  do  not  have  the  reflexivity  of  efficiency  possessed  by 
R-estimators.  For  example,  the  mean,  which  is  the  efficient  L-estimator  (as  well  as  M- 
estimator)  for  the  Gaussian  distribution,  has  an  ARE  of  50%  when  used  on  double  exponential 
data;  while  the  median,  which  is  the  efficient  L  estimator  (as  well  as  R-estimator)  for 
the  double  exponential  distribuiton,  has  an  ARE  of  64%  when  used  on  Gaussian  data.  The 
corresponding  R-estimators  have  ARE  equal  to  64%  in  both  cases. 

Relationships 

It  is  useful  to  emphasis  the  similarities  and  differences  among  the  correlation 
coefficients  for  the  three  classes  of  estimators.  Repeating  the  ARE  formulas  derived  above 
we  have: 

AREr(G|F)  =  p2r-g,/g(G_1(u)  ),  -f/f(F  1  (u> , 

ARE  (G  |  F)  =  p2[-g'/g(F_1{u)  ),  -f'/f(F  and 

M 

AREr(G|F)  =  p2  ( A(h  ,F)(u),  -f'/f(F_1(u))]  . 

L  g 

Note  that  in  R  estimation  asymptotic  relative  efficiency  is  determined  by  how  well 
f'/f  and  g'/g  correlate  when  each  is  evaluated  at  its  own  data.  In  contrast,  for  V 
estimation,  asymptotic  relative  efficiency  is  determined  by  how  well  f'/f  and  a'  \i 
correlate  when  both  are  evaluated  at  the  actual  data,  F-1(u).  For  L  estimation  the 
situation  is  different  still.  In  this  case  linear  approximations  to  f'/f  and  o'  'o  arc 
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evaluated  at  their  own  data  and  then  smoothed  by  the  distribution  of  the  actual  data. 


Asymptotic  relative  efficiency  for  L  depends  upon  how  well  thse  two  smoothed  versions 
correlated. 

A  Further  Connection 

An  obvious  similarity  among  the  correlation  coefficients  is  the  presence  of 
-f'/f(F  ^(u))  in  each  of  them  when  the  data  are  from  the  distribution  F.  Thus  the 
correlations  may  all  be  interpreted  as  beinq  with  the  optimal  rank  score  function  for  the 
actual  data.  We  can  couple  this  interpretation  with  the  result  of  Hajek  (1962)  that  for 
any  rank  score  function  one  can  find  a  corresponding  distribution  whose  amprt  has  that 
score  function.  With  appropriate  conditions  on  and  h  it  would  thus  be  possible  to 


find  distributions  G 


V  /  F 


and 


«h.l 


with  KG  _)  -  KG,  )  -  1  such  that 


*,F 


h,  F 


11^  <F(un  - 


V#  F 


*2(f‘\u)  )du 


and 


-9 


h,F  ,  -1  ,  ,, 

—  (Gh,F<U,) 


A(  h,F) (u) 

1 

/  A2(h,F) (u)du 

0 


That  is,  the  functions 


iHF~1(u)  ) 

!  <i>2<  F-1  (  u)  )du 

0 


and 


_ A(  h  ,  F )  (  u ) 

1 

0 


and 


would  correspond  to  the  score  functions  of  the  amprt's  for  the  distributions  G^  ^ 

p,  respectively.  Thus  the  score  functions  for  rank  estimators  and  their  correlations 
contain  information  concerning  not  only  the  efficiencies  of  rank  estimators  but  also 
implicitly  the  efficiencies  of  M  and  L  estimators.  This  approach  might  conceivably  be 
useful  in  extending  the  interpretation  of  some  of  the  results  presented  in  Joiner  and  Hall 
(1979) . 


6,  Conclusions 


This  paper  has  emphasized  the  important  role  played  by  f'/f  in  determining  the 
efficiency  of  all  three  major  classes  of  location  estimators:  L,  M,  and  R.  In  all 
three,  f'/f  is  used  to  define  the  asymptotically  efficient  estimator  and  a  heuristic  view 
is  given  as  to  why  f'/f  might  be  an  intuitively  reasonable  quantity  on  which  to  base 
location  estimation.  The  asymptotic  relative  efficiency  of  each  of  the  three  classes  of 
estimators  is  seen  to  depend  upon  the  degree  of  agreement  between  f'/f  of  the 
hypothesized  distribution  and  the  corresponding  quantity  g*/g  for  the  distribution  which 
actually  generated  the  data. 

The  insight  gained  in  this  paper  is  used,  in  several  companion  papers,  to  develop 
other  results  useful  in  robust  estimation.  In  Hall  and  Joiner  (1980b)  a  number  of 
numerical  and  analytical  results  are  given  for  the  asymptotic  relative  efficiencies  of  R 
estimators  optimal  for  some  distribution  F  when  applied  to  data  from  some  other 
distribution  G.  Then  in  Hall  and  Joiner  (1980c)  the  R  efficiencies  are  used  to  develop 
several  useful  low  dimensional  representations  of  the  space  of  distributions.  Underway  is 
a  quantative  comparison  of  the  relative  efficiencies  among  the  three  classes  of  estimators 
(Joiner  and  Hall,  19P0d).  This  comparison  was  prompted  by  the  relationship  noted  here  that 
in  the  correlation  coefficient  that  determines  R  efficiency,  f'/f  and  g'/g  are  both 
evaluated  at  their  own  data,  while  for  M  and  L  estimation  the  hypothesized  estimator 
is,  at  least  in  part,  evaluated  at  the  actual  data.  This  suggests  the  possibility  of  some 
general  efficiency  robustness  for  R  estimators.  However,  Bishop's  result  (cited  in 
Exhibit  5A)  shows  that-  there  exist  pairs  of  distributions  and  choices  of  scalinq  functions 
for  which  both  V  estimators  (i.e.,  for  F  on  G  data  and  for  G  on  F  data)  have 
better  ARE  than  their  p  counterparts . 

In  still  another  related  paper  (Joiner,  Hall  and  Bishop,  1980e)  the  close  relationship 
between  the  defining  equations  for  L,  V  and  P  estimators  is  used  to  extend  these 
results  to  the  general  linear  "'d'*!,  This  extension  is  equivalent  to  what  Bickel  (1978) 
has  '-ailed  "•'noudo  observation1"  l-  the  context  of  M  estimators. 
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